url <- 'https://raw.githubusercontent.com/eitanaka/DATS6101_Final_Project_Team2/main/data_set/geo_socio_health_df.csv'
master_df<- read_csv(url)
master_df <- master_df[,-1]
# To rename some columns in a data frame to make them more readable and easier to work with.This is done by using the "colnames" function to first select the column names that match the original names, and then assigning new names.
colnames(master_df)[colnames(master_df) == "MT_Never Married"] <- "mt.nev.mar"
colnames(master_df)[colnames(master_df) == "MT_Now married"] <- "mt.now.mar"
colnames(master_df)[colnames(master_df) == "Total Population"] <- "tot.pop"
colnames(master_df)[colnames(master_df) == "EA_Less than high school graduate"] <- "ea.less.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_High school graduate"] <- "ea.hs.deg"
colnames(master_df)[colnames(master_df) == "EA_college or associate's degree"] <- "ea.col.ass.deg"
colnames(master_df)[colnames(master_df) == "EA_Bachelor's degree"] <- "ea.ba.deg"
colnames(master_df)[colnames(master_df) == "EA_Graduate or professional degree"] <- "ea.grad.prof.deg"
# Create a new data only including numerical variable
numeric_vars <- sapply(master_df, is.numeric)
num_df <- master_df[, numeric_vars]

Introduction

In project 1, we investigated how health risk behaviors are associated with health outcomes and health status.We have 11 different datasets that we have made used of. We have tried importing all the datasets from the google drive where all datasets has been uploaded.Now, we are trying to examine whether there is any significant influence on mental health due to lifestyle, socio-economic factors, and covid-19 in the United States. Census tracts will be used in calculating and analyzing all the different datasets. This research which is going to be very interesting to look at identifying what factors have significantly impacted mental health and well-being in the United States in 2020.

Data sets

Need edited (Describe about 4 data set and how we create) We have used of 10 different datasets and the one which we are going to use from the previous Data Science project1.

Research / SMART Questions

SQ1: What factors do we observed highly associated with depression and poor mental health rates? SQ 2: Using those factors, how accurately can we infer depression and poor mental health rates in a given tract?

EDA

In this project, we perform EDA on a dataset containing health and socioeconomic factors for different counties in the United States. The purpose of the analysis is to investigate the relationship between these factors and two health outcomes: depression rate and poor mental health rate. We check the data quality, including missing values, outliers, and inconsistencies. We then examine the distributions of the variables and their relationships using visualizations such as histograms, scatterplots, and correlation matrices. The EDA reveals that some variables are highly correlated with each other, indicating the presence of multicollinearity. Overall, the EDA helps us to understand the data, identify potential issues, and guide our subsequent analysis. The insights gained from this analysis have important implications for public health policy and future research.

Basic Information

The data structure of the dataset is a table with 60,682 rows and 57 columns. The columns represent various variables, including health indicators, demographic characteristics, and economic indicators. The data types of the variables are numeric, except for the one-factor variable (index_apr20). The dataset has also omitted some rows due to missing data, as indicated by the "na.action" attribute. The dataset is a comprehensive health and demographic information collection for a large population.
str(num_df)
## tibble [60,962 × 57] (S3: tbl_df/tbl/data.frame)
##  $ ACCESS2                           : num [1:60962] 15 19.9 12.5 18 19.6 16.7 16.5 21.1 12.7 17.8 ...
##  $ BINGE                             : num [1:60962] 15.8 14.1 15.2 14.9 14.9 15.5 14.9 12.5 16.6 14.9 ...
##  $ CHECKUP                           : num [1:60962] 74.1 76.5 75.5 74.6 74 74.3 75.2 78.3 73.9 75.5 ...
##  $ DEPRESSION                        : num [1:60962] 25.9 23.9 24 26.9 28.5 26.2 25.9 24.5 24.6 25.9 ...
##  $ DIABETES                          : num [1:60962] 10.7 13.4 10.2 12.8 12.8 11.3 12.8 17.4 9.7 13.8 ...
##  $ LPA                               : num [1:60962] 26.4 32.1 23 31.1 33.4 28.2 29.8 37.2 22.8 29.6 ...
##  $ MHLTH                             : num [1:60962] 16.4 17.3 14 17.9 19.1 17 17 17.7 15.1 16.9 ...
##  $ OBESITY                           : num [1:60962] 37 43.9 33.2 41.1 41 37.9 40.3 46.1 35.7 36.2 ...
##  $ PHLTH                             : num [1:60962] 11.3 12.1 9.8 13.5 14.5 11.6 12.8 15.4 9.8 13 ...
##  $ SLEEP                             : num [1:60962] 36.9 43.4 33.4 39.5 39.8 38.1 39.2 43.8 35.8 37.6 ...
##  $ STROKE                            : num [1:60962] 3.1 3.7 3 3.8 4 3.3 3.8 5.3 2.7 4 ...
##  $ mt.nev.mar                        : num [1:60962] 23.7 39.5 20.9 27.5 36.9 ...
##  $ mt.now.mar                        : num [1:60962] 54.8 31.8 63.8 52.9 37 ...
##  $ MT_Divorces                       : num [1:60962] 11.97 18.56 10.09 9.62 18.61 ...
##  $ MT_Separated                      : num [1:60962] 1.39 2.62 0 1.17 1.61 ...
##  $ MT_Widowed                        : num [1:60962] 8.11 7.54 5.27 8.87 5.9 ...
##  $ ea.less.hs.deg                    : num [1:60962] 14.29 10.59 6.23 12.86 17.91 ...
##  $ ea.hs.deg                         : num [1:60962] 33.8 50.3 28.1 31.2 38.4 ...
##  $ ea.col.ass.deg                    : num [1:60962] 27 23.2 26.4 34.3 28 ...
##  $ ea.ba.deg                         : num [1:60962] 15.4 12.9 22.8 12.2 10.8 ...
##  $ ea.grad.prof.deg                  : num [1:60962] 9.4 2.97 16.43 9.5 4.93 ...
##  $ MI_Estimate                       : num [1:60962] 26504 26173 37529 20669 24181 ...
##  $ tot.pop                           : num [1:60962] 1941 1757 3539 3536 3562 ...
##  $ CT_<10                            : num [1:60962] 17.68 5.65 13.98 21.38 19.81 ...
##  $ CT_10-14                          : num [1:60962] 27.6 17.54 8.91 13.07 12.08 ...
##  $ CT_15-19                          : num [1:60962] 7.02 3.48 13.78 16.25 15.53 ...
##  $ CT_20-24                          : num [1:60962] 14.29 7.68 14.37 10.42 20.91 ...
##  $ CT_25-29                          : num [1:60962] 5.57 0.87 8.26 5.39 6.07 ...
##  $ CT_30-34                          : num [1:60962] 18.8 39.9 22.1 18.6 16.1 ...
##  $ CT_35-44                          : num [1:60962] 7.02 14.2 7.87 7.51 3.59 ...
##  $ CT_45-59                          : num [1:60962] 1.69 8.41 3.64 3.27 4.55 ...
##  $ CT_>60                            : num [1:60962] 0.363 2.319 7.087 4.152 1.38 ...
##  $ ES_Total_labor_force              : num [1:60962] 54.7 50.3 54.5 44.9 58.9 ...
##  $ ES_Civilian_labor_force           : num [1:60962] 54.7 49.4 54.2 43.5 58.3 ...
##  $ ES_Civilian_labor_force_employed  : num [1:60962] 53.5 47.4 52.9 42.3 53.5 ...
##  $ ES_Civilian_labor_force_unemployed: num [1:60962] 1.16 2 1.27 1.21 4.78 ...
##  $ ES_Armed_Forces                   : num [1:60962] 0 0.828 0.327 1.423 0.56 ...
##  $ ES_Not_in_labor_force             : num [1:60962] 45.3 49.7 45.5 55.1 41.1 ...
##  $ land                              : num [1:60962] 3.79 1.29 2.46 3.1 8.65 ...
##  $ urban                             : num [1:60962] 1594 2170 4386 3595 2505 ...
##  $ poverty                           : num [1:60962] 11.34 17.88 2.85 21.58 30.5 ...
##  $ no.ins                            : num [1:60962] 9.26 9.37 2.82 14.87 16.23 ...
##  $ disab                             : num [1:60962] 17.6 17.1 19.6 25.7 24.7 ...
##  $ no.comp                           : num [1:60962] 11.23 17.36 7.6 8.45 9.54 ...
##  $ broad&comp                        : num [1:60962] 80.9 79.2 86 87.6 85.1 ...
##  $ no.eng                            : num [1:60962] 1.83 0.56 0.98 0.68 0.99 0 0 0 0 0 ...
##  $ sing.mom                          : num [1:60962] 14.12 17.25 7.69 9.38 27.48 ...
##  $ live.alone                        : num [1:60962] 22.9 37.3 28.4 18.7 28.5 ...
##  $ pub.assist                        : num [1:60962] 0 1.11 1.22 0.68 4.51 2.16 1.27 0.45 1.47 0 ...
##  $ no.phone                          : num [1:60962] 2.88 2.36 2.5 0.45 0.78 0.69 0.49 4.91 0.88 0.71 ...
##  $ no.plumb                          : num [1:60962] 0 2 0.86 3.47 0 ...
##  $ married.kid                       : num [1:60962] 36 49.2 36.3 43.6 50.1 ...
##  $ hhd.no.comp                       : num [1:60962] 18.3 27.8 10.4 11.6 13.1 ...
##  $ hhd.only.phone                    : num [1:60962] 6.01 5.7 4.27 5.41 3.59 ...
##  $ hhd.no.int                        : num [1:60962] 26.5 30.5 17.5 17.2 17.7 ...
##  $ hhd.broad                         : num [1:60962] 59.7 55.1 74.6 65.4 66 ...
##  $ index_apr20                       : num [1:60962] 0.913 0.913 0.913 0.913 0.913 ...

The table displays various health and socio-economic indicators for a population. The mean values for several health indicators, such as depression and obesity, are high, indicating potential health concerns in the population. Education levels vary, with a mean of 12.2% having less than a high school degree and 28.8% having some college or an associate’s degree. The poverty rate is 15.3%, and 9.1% of the population does not have health insurance. Additionally, 13.3% of households do not have a computer and 28.1% of individuals live alone.

summary(num_df)
##     ACCESS2         BINGE         CHECKUP       DEPRESSION      DIABETES   
##  Min.   : 2.6   Min.   : 2.7   Min.   :49.7   Min.   : 8.5   Min.   : 0.6  
##  1st Qu.: 9.8   1st Qu.:14.7   1st Qu.:70.6   1st Qu.:17.8   1st Qu.: 8.5  
##  Median :13.3   Median :16.7   Median :74.8   Median :20.3   Median :10.3  
##  Mean   :15.5   Mean   :16.8   Mean   :74.0   Mean   :20.4   Mean   :10.9  
##  3rd Qu.:19.2   3rd Qu.:18.7   3rd Qu.:77.7   3rd Qu.:22.9   3rd Qu.:12.6  
##  Max.   :64.9   Max.   :36.6   Max.   :90.8   Max.   :37.8   Max.   :42.2  
##                                                                            
##       LPA           MHLTH         OBESITY         PHLTH          SLEEP     
##  Min.   : 7.8   Min.   : 6.1   Min.   :11.5   Min.   : 2.3   Min.   :19.8  
##  1st Qu.:18.7   1st Qu.:13.1   1st Qu.:28.0   1st Qu.: 8.4   1st Qu.:30.7  
##  Median :23.7   Median :15.0   Median :33.1   Median :10.3   Median :33.5  
##  Mean   :24.6   Mean   :15.1   Mean   :33.0   Mean   :10.8   Mean   :34.0  
##  3rd Qu.:29.5   3rd Qu.:16.9   3rd Qu.:37.7   3rd Qu.:12.7   3rd Qu.:36.7  
##  Max.   :63.7   Max.   :33.0   Max.   :63.9   Max.   :31.3   Max.   :54.4  
##                                                                            
##      STROKE        mt.nev.mar      mt.now.mar     MT_Divorces     MT_Separated 
##  Min.   : 0.20   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   : 0.0  
##  1st Qu.: 2.40   1st Qu.: 24.6   1st Qu.: 37.7   1st Qu.:  7.9   1st Qu.: 0.6  
##  Median : 3.00   Median : 31.5   Median : 48.2   Median : 10.7   Median : 1.5  
##  Mean   : 3.19   Mean   : 34.2   Mean   : 46.7   Mean   : 11.2   Mean   : 1.9  
##  3rd Qu.: 3.70   3rd Qu.: 41.4   3rd Qu.: 57.1   3rd Qu.: 13.9   3rd Qu.: 2.7  
##  Max.   :20.50   Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :41.8  
##                  NA's   :17      NA's   :17      NA's   :17      NA's   :17    
##    MT_Widowed   ea.less.hs.deg   ea.hs.deg    ea.col.ass.deg    ea.ba.deg    
##  Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.: 3.8   1st Qu.: 4.9   1st Qu.:19.6   1st Qu.: 23.4   1st Qu.: 10.8  
##  Median : 5.6   Median : 9.4   Median :28.4   Median : 29.1   Median : 17.2  
##  Mean   : 6.1   Mean   :12.2   Mean   :27.8   Mean   : 28.8   Mean   : 19.1  
##  3rd Qu.: 7.7   3rd Qu.:16.6   3rd Qu.:36.1   3rd Qu.: 34.5   3rd Qu.: 26.3  
##  Max.   :57.9   Max.   :83.0   Max.   :88.8   Max.   :100.0   Max.   :100.0  
##  NA's   :17     NA's   :25     NA's   :25     NA's   :25      NA's   :25     
##  ea.grad.prof.deg  MI_Estimate        tot.pop          CT_<10     
##  Min.   :  0.0    Min.   :  2499   Min.   :    0   Min.   :  0.0  
##  1st Qu.:  4.7    1st Qu.: 25068   1st Qu.: 2805   1st Qu.:  6.0  
##  Median :  8.7    Median : 31500   Median : 3868   Median : 10.5  
##  Mean   : 12.1    Mean   : 34406   Mean   : 4006   Mean   : 13.5  
##  3rd Qu.: 16.4    3rd Qu.: 40680   3rd Qu.: 5037   3rd Qu.: 17.4  
##  Max.   :100.0    Max.   :204750   Max.   :39373   Max.   :100.0  
##  NA's   :25       NA's   :71                       NA's   :93     
##     CT_10-14        CT_15-19        CT_20-24        CT_25-29       CT_30-34    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   : 0.0   Min.   :  0.0  
##  1st Qu.:  7.9   1st Qu.:  9.8   1st Qu.:  9.2   1st Qu.: 3.2   1st Qu.:  8.2  
##  Median : 12.2   Median : 14.3   Median : 13.5   Median : 5.7   Median : 12.7  
##  Mean   : 13.6   Mean   : 15.2   Mean   : 14.1   Mean   : 6.3   Mean   : 13.3  
##  3rd Qu.: 17.9   3rd Qu.: 19.7   3rd Qu.: 18.3   3rd Qu.: 8.7   3rd Qu.: 17.6  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :64.6   Max.   :100.0  
##  NA's   :93      NA's   :93      NA's   :93      NA's   :93     NA's   :93     
##     CT_35-44        CT_45-59         CT_>60      ES_Total_labor_force
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0       
##  1st Qu.:  3.1   1st Qu.:  3.4   1st Qu.:  3.5   1st Qu.: 57.3       
##  Median :  6.0   Median :  6.9   Median :  7.0   Median : 63.7       
##  Mean   :  6.8   Mean   :  8.0   Mean   :  9.3   Mean   : 62.7       
##  3rd Qu.:  9.6   3rd Qu.: 11.3   3rd Qu.: 12.6   3rd Qu.: 69.3       
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0       
##  NA's   :93      NA's   :93      NA's   :93      NA's   :17          
##  ES_Civilian_labor_force ES_Civilian_labor_force_employed
##  Min.   :  0.0           Min.   :  0.0                   
##  1st Qu.: 57.0           1st Qu.: 53.1                   
##  Median : 63.4           Median : 59.9                   
##  Mean   : 62.2           Mean   : 58.7                   
##  3rd Qu.: 69.0           3rd Qu.: 65.6                   
##  Max.   :100.0           Max.   :100.0                   
##  NA's   :17              NA's   :17                      
##  ES_Civilian_labor_force_unemployed ES_Armed_Forces ES_Not_in_labor_force
##  Min.   : 0.0                       Min.   : 0.0    Min.   :  0.0        
##  1st Qu.: 1.8                       1st Qu.: 0.0    1st Qu.: 30.7        
##  Median : 3.0                       Median : 0.0    Median : 36.3        
##  Mean   : 3.6                       Mean   : 0.4    Mean   : 37.3        
##  3rd Qu.: 4.6                       3rd Qu.: 0.0    3rd Qu.: 42.7        
##  Max.   :33.4                       Max.   :99.4    Max.   :100.0        
##  NA's   :17                         NA's   :17      NA's   :17           
##       land           urban          poverty          no.ins         disab      
##  Min.   :    0   Min.   :    0   Min.   :  0.0   Min.   : 0.0   Min.   :  0.0  
##  1st Qu.:    1   1st Qu.:    3   1st Qu.:  6.6   1st Qu.: 4.1   1st Qu.:  9.3  
##  Median :    2   Median : 2989   Median : 12.1   Median : 7.3   Median : 12.6  
##  Mean   :   47   Mean   : 2793   Mean   : 15.3   Mean   : 9.1   Mean   : 13.5  
##  3rd Qu.:   10   3rd Qu.: 4368   3rd Qu.: 20.9   3rd Qu.:12.3   3rd Qu.: 16.7  
##  Max.   :85426   Max.   :31777   Max.   :100.0   Max.   :74.7   Max.   :100.0  
##  NA's   :18      NA's   :18      NA's   :146     NA's   :29     NA's   :111    
##     no.comp        broad&comp        no.eng         sing.mom    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  3.4   1st Qu.: 76.4   1st Qu.:  0.0   1st Qu.:  7.3  
##  Median :  6.8   Median : 85.0   Median :  1.6   Median : 11.3  
##  Mean   :  8.5   Mean   : 82.6   Mean   :  4.6   Mean   : 13.3  
##  3rd Qu.: 11.6   3rd Qu.: 91.6   3rd Qu.:  5.4   3rd Qu.: 17.3  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :164     NA's   :164     NA's   :164     NA's   :164    
##    live.alone      pub.assist      no.phone        no.plumb      married.kid   
##  Min.   :  0.0   Min.   : 0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.: 20.4   1st Qu.: 0.8   1st Qu.:  0.8   1st Qu.:  0.0   1st Qu.: 33.4  
##  Median : 27.1   Median : 1.8   Median :  1.7   Median :  0.9   Median : 40.7  
##  Mean   : 28.1   Mean   : 2.7   Mean   :  2.2   Mean   :  2.4   Mean   : 41.2  
##  3rd Qu.: 34.6   3rd Qu.: 3.6   3rd Qu.:  3.1   3rd Qu.:  3.0   3rd Qu.: 48.8  
##  Max.   :100.0   Max.   :53.5   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :164     NA's   :171    NA's   :164     NA's   :163     NA's   :213    
##   hhd.no.comp    hhd.only.phone    hhd.no.int      hhd.broad    
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  5.9   1st Qu.:  2.1   1st Qu.:  8.7   1st Qu.: 54.1  
##  Median : 10.9   Median :  4.7   Median : 15.6   Median : 68.0  
##  Mean   : 12.5   Mean   :  5.9   Mean   : 17.4   Mean   : 65.7  
##  3rd Qu.: 17.2   3rd Qu.:  8.4   3rd Qu.: 23.8   3rd Qu.: 79.4  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##  NA's   :164     NA's   :164     NA's   :164     NA's   :164    
##   index_apr20   
##  Min.   :0.360  
##  1st Qu.:0.855  
##  Median :0.885  
##  Mean   :0.883  
##  3rd Qu.:0.916  
##  Max.   :1.194  
## 

Handle Null Values

We confirm that there are null values in the data set. To remove rows with null values in the columns listed below, we use the "na.omit" function. This function removes rows with any missing values in the specified columns. After removing the rows with null values, the resulting dataset would only contain rows with complete data in the specified columns.

add comment here

colSums(is.na(num_df))
num_df <- na.omit(num_df)

Handle Outlier

 Next, we are handling outliers in the numerical data. We apply the outlierKD2 function from the ezids package to detect and remove the outliers in the dataset. We loop through all columns and remove outliers in each column by calling the outlierKD2 function. We remove any resulting NAs and update the data frame after each column. Finally, we remove the last two columns which were added by outlierKD2. We believe that handling outliers is essential in creating an accurate model because outliers can have a significant impact on the results, especially in regression models, and can lead to poor performance and inaccurate predictions.
new_num_df <- outlierKD2(num_df, num_df[[1]], rm=TRUE, boxplt=F, histogram=F,qqplt=F)
new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]

# loop through all columns
for (col_name in colnames(num_df)[2:ncol(num_df)]) {
  # remove outliers
  new_num_df <- outlierKD2(new_num_df, new_num_df[[col_name]], rm=T, boxplt=F, histogram=F,qqplt=F)
  new_num_df <- na.omit(new_num_df)
  new_num_df <- new_num_df[,1:(ncol(new_num_df)-2)]
}
new_num_df <- new_num_df[,1:(ncol(new_num_df)-1)]

Summary Statistics (after cleaning)

After cleaning, the new data set contains 57 rows and 57 columns. The follow table show the summary statistics after cleaning.

summary(new_num_df)

Boxplot Analysis

The box plots show the distribution of various health, demographic, and socioeconomic. The plots suggest that obesity, diabetes, and depression rates are relatively high, while education levels and median household income are moderate. The plots also reveal that rural areas have lower population densities and higher poverty rates compared to urban areas. The prevalence of single mothers, those living alone, and individuals receiving public assistance is higher in certain tracts.

# with outliers
# par(mfrow=c(8, 8))
# par(mar=c(1, 1, 1, 1))
# for (i in 1:ncol(num_df)) {
#   boxplot(num_df[[i]], main=names(num_df)[i])
# }

# without outliers
par(mfrow=c(8, 8))
par(mar=c(1, 1, 1, 1))
for (i in 1:ncol(new_num_df)) {
  boxplot(new_num_df[[i]], main=names(new_num_df)[i])
}

Scatter plot

We create scatter plots to visualize the relationship between depression and other variables, as well as poor mental health and other variables. The scatter plots show the distribution of each variable and the trend of the relationship between the variables. The data used in the plots come from a survey that collected information on various demographic, socioeconomic, and health-related factors.

Scatter plot for depression

Based on the scatter plot analysis of depression, it appears that depression is positively associated with “OBESITY,” “PHLTH,” and “ea.hs.deg.” On the other hand, there seems to be a negative association between depression and “MI_Estimate,” “ea.grad.prof.deg,” “CT_>60,” and “hhd.broad,” indicating that as these variables increase, the level of depression decreases. The strength of these associations varies, with “OBESITY,” “PHLTH,” and “MI_Estimate” having relatively stronger positive correlations with depression, while “ea.hs.deg” and “ea.grad.prof.deg” have weaker negative correlations.

theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
  if (col != "DEPRESSION") {
    plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$DEPRESSION)
    plot <- ggplot(plot_data, aes(x = x, y = y)) + 
      geom_point(size=0.3) +
      geom_smooth(method = "lm", se = FALSE) +
      xlab(col) +
      theme(legend.position = "none")
    plot_list[[col]] <- plot
  }
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)

Scatter plot for poor mental heealht

The scatter plots with a fitted line for poor mental health (MHLTH) and the above highly correlated variables show that MHLTH is positively correlated with PHLTH, OBESITY, and poverty. The scatter plot for PHLTH and MHLTH shows a clear positive linear relationship.

# Scatter plots with fit line for mhlth and other dependent variables
theme_set(theme_pubr(base_size = 3.5))
plot_list <- list()
for (col in names(new_num_df)) {
  if (col != "MHLTH") {
    plot_data <- data.frame(x = new_num_df[[col]], y = new_num_df$MHLTH)
    plot <- ggplot(plot_data, aes(x = x, y = y)) + 
      geom_point(size=0.3) +
      geom_smooth(method = "lm", se = FALSE) +
      xlab(col) +
      theme(legend.position = "none")
    plot_list[[col]] <- plot
  }
}
ggarrange(plotlist = plot_list, ncol = 7, nrow = 8)

Distribution of dependent variable

The purpose of this section is to determine the distribution of the dependent variables, depression rate and poor mental health rate, to confirm which model would be appropriate for further analysis. The results show that both depression rate and poor mental health rate are bell-shaped, indicating a normal distribution. This is confirmed through the use of histograms and qqplots, with both variables showing a close-to-straight-line pattern. The normal distribution of the dependent variables meets the assumption of linear regression, indicating that a linear regression model can be used for further analysis. This finding is significant as it allows for the exploration of the relationship between the dependent variables and the independent variables. The use of linear regression can provide valuable insight into potential risk factors for depression and poor mental health, allowing for the development of targeted interventions to improve mental health outcomes.

# Histograms
hist(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")

hist(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")

# QQ plots for the distribution of tract-level depression rates and tract-level poor mental health rates.

qqnorm(new_num_df$DEPRESSION, main = "Distribution of Tract-level Depression Rates")
qqline(new_num_df$DEPRESSION)

qqnorm(new_num_df$MHLTH, main = "Distribution of Tract-level Poor Mental Health Rates")
qqline(new_num_df$MHLTH)

Correlations Test

Add comment here

# Create a correlation matrix
cor_matrix <- cor(new_num_df)

# Create two lists which have the names of variables highly correlated (more then 0.3 or less than -0.3)
high_dep_cor_list <- names(which(cor_matrix["DEPRESSION",] > 0.35 | cor_matrix["DEPRESSION",] < -0.35))
high_mhlth_cor_list <- names(which(cor_matrix["MHLTH",] > 0.5 | cor_matrix["MHLTH",] < -0.5))

high_dep_cor_list <- high_dep_cor_list[high_dep_cor_list != "MHLTH"]
high_mhlth_cor_list <- high_mhlth_cor_list[high_mhlth_cor_list != "DEPRESSION"]

length(high_dep_cor_list)  
length(high_mhlth_cor_list)

Correlation Heatmap

add comment here

# Create two correlation matrixes from new_num_df in the above lists
high_dep_cor_mat <- cor(new_num_df[high_dep_cor_list])
high_mhlth_cor_mat <- cor(new_num_df[high_mhlth_cor_list])

high_dep_cor_mat
high_mhlth_cor_mat

# Plot above correlation matrix
corrplot(high_dep_cor_mat, type = "lower", outline.color = "white", 
           colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation", 
           ggtheme = theme_gray, title = "Highly Correlated Variables with DEPRESSION")

corrplot(high_mhlth_cor_mat, type = "lower", outline.color = "white", 
           colors = c("#6D9EC1", "white", "#E46726"), legend.title = "Correlation", 
           ggtheme = theme_gray, title = "Highly Correlated Variables with mhlth")

my_col <- colorRampPalette(c("red", "white", "blue"))(30)

corrplot(high_dep_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))

corrplot(high_mhlth_cor_mat, method = "number", type = "lower", order = "hclust", diag = FALSE, tl.col = "black", tl.cex = 0.8, cl.cex = 0.8, addCoef.col = "black", col = my_col, main = "", mar = c(0,0,0,0))

VIF Test for Multicollinearity

add comment here

calculate_VIF <- function(data, target_col) {
  X <- data[, !colnames(data) %in% target_col]
  vif <- data.frame(
    Feature = colnames(X),
    VIF = apply(X, 2, function(x) vif(lm(x ~ ., data=X)))
  )
  return(vif)
}

depression_VIF <- calculate_VIF(new_num_df[high_dep_cor_list], "DEPRESSION")

mhlth_VIF <- calculate_VIF(new_num_df[high_mhlth_cor_list], "MHLTH")

depression_VIF
mhlth_VIF

Feature Selection

Add comment here

dep_features_list <- high_dep_cor_list[high_dep_cor_list != "DEPRESSION"]
mhlth_features_list <- high_mhlth_cor_list[high_mhlth_cor_list != "MHLTH"]

dep_features_list
mhlth_features_list

Model Building

Add comment here

Training and Test Sets

add comment here

library(caret)
set.seed(123)
fold <- floor(runif(nrow(new_num_df),1,11)) 
  new_num_df$fold <- fold
  
test.set <- new_num_df[new_num_df$fold == 1,] 
train.set <- new_num_df[new_num_df$fold != 1,] 

Multiple Linear Regression

add comment here

Model 1: With Highly correlated variables (Health Factors + Socioeconomic Factors)

add comment here

# multiple linear regression model for depression
depression_model_1 <- lm(DEPRESSION ~ OBESITY + PHLTH + ea.hs.deg + ea.grad.prof.deg + MI_Estimate + `CT_>60` + hhd.broad, data = train.set)

# multiple linear regression model for poor mental health
mhlth_model_1 <- lm(MHLTH ~ LPA + OBESITY + PHLTH + SLEEP + ea.hs.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + poverty, data = train.set)

summary(depression_model_1)
summary(mhlth_model_1)

plot(depression_model_1)

plot(mhlth_model_1)

Model Comparison (ANOVA)

Add comment here

# Perform two-way ANOVA test
depression_anova_model <- anova(depression_model_1, depression_model_2, depression_model_3)
mhlth_anova_model <- anova(mhlth_model_1, mhlth_model_2, mhlth_model_3)


xkabledply(depression_anova_model)
xkabledply(mhlth_anova_model)

health.socioecon.dep.anova <- anova(depression_model_2, depression_model_3)
health.socioecon.mhlth.anova <- anova(mhlth_model_2, mhlth_model_3)

xkabledply(health.socioecon.dep.anova)
xkabledply(health.socioecon.mhlth.anova)

Stepwise Selection (Tuning Multiple Linear Regression Model)

We performed a stepwise model using the AIC criterion to identify the optimal combination of features. AIC accounts for the number of parameters in the model, which helps to prevent overfitting and avoid overly complex models.

# Perform stepwise selection using AIC as the selection criterion
depression_stepwise_model <- step(depression_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")

mhlth_stepwise_model <- step(mhlth_model_1, direction = "both", trace = 0, k = log(nrow(train.set)), criterion = "AIC")

# Print the final selected model
summary(depression_stepwise_model)
summary(mhlth_stepwise_model)

Best Model vs. Stepwise Model

The stepwise model had fewer variables and demonstrated similar performance as the initial model when compared using an ANOVA test. The residual sum of squares (RSS) for the depression and poor mental health models were almost identical. Therefore, we chose the stepwise model over the initial model as it was more parsimony.

dep_anova_model_2 <- anova(depression_model_1, depression_stepwise_model)
mhlth_anova_model_2 <- anova(mhlth_model_1, mhlth_stepwise_model)

xkabledply(dep_anova_model_2)
xkabledply(mhlth_anova_model_2)

Best Model Performance

This section summarizes a summary of the models we considered to be the best performing based on the model selection to date.

For depression rates, our model identified five significant predictor variables: “OBESITY Rate”, “Poor physical health”, “high school graduate as highest level educational attainment”, “CT_>60”, and “percent of households interacting with the internet”. The model explained 37.6% of the variability in depression levels, which is a moderate amount.

For poor mental health rates, our model identified seven significant predictor variables: “LPA”, “Obesity”, “PHLTH”, “Sleep”, “ea.graduate”, “MI_income”, and “poverty”. The model explained 67.6% of the variability in poor mental health levels, indicating a strong relationship between the predictor variables and poor mental health.

Overall, our findings suggest that demographic, socio-economic, and health-related variables play a significant role in predicting depression and poor mental health rates in the United States. Our models can serve as a basis for further research on this topic and can assist policymakers in identifying high-risk populations and designing targeted interventions.

dep_predictions <- predict(depression_model_1, newdata = test.set)

# calculate the R-squared value
r_squared <- summary(depression_model_1)$r.squared

# calculate the mean squared error (MSE)
mse <- mean((test.set$DEPRESSION - dep_predictions)^2)

# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")

mhlth_predictions <- predict(mhlth_stepwise_model, newdata = test.set)

# calculate the R-squared value
r_squared <- summary(mhlth_stepwise_model)$r.squared

# calculate the mean squared error (MSE)
mse <- mean((test.set$MHLTH - mhlth_predictions)^2)

# print the results
cat("R-squared:", r_squared, "\n")
cat("MSE:", mse, "\n")

Regression Tree Model

Add comment here

# Building a Regression Tree

# Step 1. Use recursive binary splitting to grow a large tree of depression on the training data
train.dep.tree <- rpart(DEPRESSION ~  ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set, method = "anova")

# Grow large tree of mhlth on the training data 
train.mhlth.tree <- rpart(MHLTH ~ ACCESS2 + BINGE + CHECKUP + DIABETES + LPA + OBESITY + PHLTH + SLEEP + STROKE + mt.nev.mar + mt.now.mar + MT_Divorces + MT_Separated + MT_Widowed + ea.less.hs.deg + ea.hs.deg + ea.col.ass.deg + ea.ba.deg + ea.grad.prof.deg + MI_Estimate + tot.pop + `CT_<10` + `CT_10-14` + `CT_15-19` + `CT_20-24` + `CT_25-29` + `CT_30-34` + `CT_35-44` + `CT_45-59` + `CT_>60` + ES_Total_labor_force + ES_Civilian_labor_force + ES_Civilian_labor_force_employed + ES_Civilian_labor_force_unemployed + ES_Armed_Forces + ES_Not_in_labor_force + land + urban + poverty + no.ins + disab + no.comp + `broad&comp` + no.eng + sing.mom + live.alone + pub.assist + no.phone + no.plumb + married.kid + hhd.no.comp + hhd.only.phone + hhd.no.int + hhd.broad + index_apr20, data = train.set)

# Plot tree model
fancyRpartPlot(train.dep.tree, main = "Depression Rate Tree")

fancyRpartPlot(train.mhlth.tree, main = "Poor Mental Health Rate Tree")

# Print summary
summary(train.dep.tree)
summary(train.mhlth.tree)

Regression Tree Performance

  • Evaluating the model: The model is evaluated using various metrics such as mean squared error (MSE), root mean squared error (RMSE), and R-squared.
  • Validating the model: The model is validated using the testing set to assess its ability to generalize to new data.
  • Tuning the model: The model is tuned by adjusting the hyperparameters to improve its performance.
  • Interpreting the model: The regression tree is interpreted to gain insights into the relationships between the independent variables and the dependent variable.
# Generate predicted values for test set
test.set$pred.depression <- predict(train.dep.tree, newdata = test.set)

# Calculate evaluation metrics
MSE <- mean((test.set$DEPRESSION - test.set$pred.depression)^2)
R2 <- 1 - sum((test.set$DEPRESSION - test.set$pred.depression)^2) / sum((test.set$DEPRESSION - mean(test.set$DEPRESSION))^2)

# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")

# The R-squared value of 0.399 suggests that the regression tree model explains 39.9% of the variation in the data, which is a moderate level of explanation.
# The MSE (Mean Squared Error) of 6.05 represents the average squared difference between the predicted values and the actual values, with a higher value indicating worse performance.

test.set$pred.mhlth <- predict(train.mhlth.tree, newdata = test.set)
# Calculate evaluation metrics
MSE <- mean((test.set$MHLTH - test.set$pred.mhlth)^2)
R2 <- 1 - sum((test.set$MHLTH - test.set$pred.mhlth)^2) / sum((test.set$MHLTH - mean(test.set$MHLTH))^2)

# Print evaluation metrics
cat("R-squared:", R2, "\n")
cat("MSE:", MSE, "\n")

# The regression tree model with MHLTH as the target variable has an R-squared value of 0.685, indicating that 68.5% of the variability in the MHLTH variable can be explained by the model. 
# The mean squared error (MSE) is 1.07, which means that on average, the model's predictions are off by 1.07 units from the actual values. .
# The root mean squared error (RMSE) is 1.04, which is the same as the standard deviation of the residuals.
# This indicates that the model's predictions have a standard deviation of 1.04 around the actual values. Overall, these metrics suggest that the model has a moderate level of predictive power for the MHLTH variable.

Result

Need edited

Discussion

Need edited

Limitations

Need edited

Further Research

Need edited

Conclusion

Need edited